Data Preparation and Preprocessing with BigQuery

This notebook is the first of a set of steps to run machine learning on the cloud. This step is about data preparation and preprocessing, and will mirror the equivalent portions of the local notebook.

Workspace Setup

The first step is to setup the workspace that we will use within this notebook - the python libraries, and the Google Cloud Storage bucket that will be used to contain the inputs and outputs produced over the course of the steps.



In [1]:

    
import google.datalab as datalab
import google.datalab.ml as ml
import mltoolbox.regression.dnn as regression
import os

The storage bucket we create will be created by default using the project id.



In [2]:

    
storage_bucket = 'gs://' + datalab.Context.default().project_id + '-datalab-workspace/'
storage_region = 'us-central1'

workspace_path = os.path.join(storage_bucket, 'census')

# We will rely on outputs from data preparation steps in the previous notebook.
local_workspace_path = '/content/datalab/workspace/census'



In [ ]:

    
!gsutil mb -c regional -l {storage_region} {storage_bucket}

NOTE: If you have previously run this notebook, and want to start from scratch, then run the next cell to delete previous outputs.



In [ ]:

    
!gsutil -m rm -rf {workspace_path}

Data

To get started, we will copy the data into this workspace from the local workspace created in the previous notebook.

Generally, in your own work, you will have existing data to work with, that you may or may not need to copy around, depending on its current location.



In [5]:

    
!gsutil -q cp {local_workspace_path}/data/train.csv {workspace_path}/data/train.csv
!gsutil -q cp {local_workspace_path}/data/eval.csv {workspace_path}/data/eval.csv
!gsutil -q cp {local_workspace_path}/data/schema.json {workspace_path}/data/schema.json
!gsutil ls -r {workspace_path}









    



gs://cloud-ml-users-datalab-workspace/census/:

gs://cloud-ml-users-datalab-workspace/census/data/:
gs://cloud-ml-users-datalab-workspace/census/data/eval.csv
gs://cloud-ml-users-datalab-workspace/census/data/schema.json
gs://cloud-ml-users-datalab-workspace/census/data/train.csv

DataSets



In [6]:

    
train_data_path = os.path.join(workspace_path, 'data/train.csv')
eval_data_path = os.path.join(workspace_path, 'data/eval.csv')
schema_path = os.path.join(workspace_path, 'data/schema.json')

train_data = ml.CsvDataSet(file_pattern=train_data_path, schema_file=schema_path)
eval_data = ml.CsvDataSet(file_pattern=eval_data_path, schema_file=schema_path)

Data Analysis

When building a model, a number of pieces of information about the training data are required - for example, the list of entries or vocabulary of a categorical/discrete column, or aggregate statistics like min and max for numerical columns. These require a full pass over the training data, and is usually done once, and needs to be repeated once if you change the schema in a future iteration.

On the Cloud, this analysis is done with BigQuery, by referencing the csv data in storage as external data sources. The output of this analysis will be stored into storage.

In the analyze() call below, notice the use of cloud=True to move data analysis from happening locally to happening in the cloud.



In [7]:

    
analysis_path = os.path.join(workspace_path, 'analysis')

regression.analyze(dataset=train_data, output_dir=analysis_path, cloud=True)









    



Track BigQuery status at
https://bigquery.cloud.google.com/queries/cloud-ml-users
Running numerical analysis...done.
Running categorical analysis...done.
Analyze: completed

Like in the local notebook, the output of analysis is a stats file that contains analysis from the numerical columns, and a vocab file from each categorical column.



In [8]:

    
!gsutil ls {analysis_path}









    



gs://cloud-ml-users-datalab-workspace/census/analysis/schema.json
gs://cloud-ml-users-datalab-workspace/census/analysis/stats.json
gs://cloud-ml-users-datalab-workspace/census/analysis/vocab_AGEP.csv
gs://cloud-ml-users-datalab-workspace/census/analysis/vocab_COW.csv
gs://cloud-ml-users-datalab-workspace/census/analysis/vocab_ESP.csv
gs://cloud-ml-users-datalab-workspace/census/analysis/vocab_ESR.csv
gs://cloud-ml-users-datalab-workspace/census/analysis/vocab_FOD1P.csv
gs://cloud-ml-users-datalab-workspace/census/analysis/vocab_HINS4.csv
gs://cloud-ml-users-datalab-workspace/census/analysis/vocab_INDP.csv
gs://cloud-ml-users-datalab-workspace/census/analysis/vocab_JWMNP.csv
gs://cloud-ml-users-datalab-workspace/census/analysis/vocab_JWTR.csv
gs://cloud-ml-users-datalab-workspace/census/analysis/vocab_MAR.csv
gs://cloud-ml-users-datalab-workspace/census/analysis/vocab_POWPUMA.csv
gs://cloud-ml-users-datalab-workspace/census/analysis/vocab_PUMA.csv
gs://cloud-ml-users-datalab-workspace/census/analysis/vocab_RAC1P.csv
gs://cloud-ml-users-datalab-workspace/census/analysis/vocab_SCHL.csv
gs://cloud-ml-users-datalab-workspace/census/analysis/vocab_SCIENGRLP.csv
gs://cloud-ml-users-datalab-workspace/census/analysis/vocab_SERIALNO.csv
gs://cloud-ml-users-datalab-workspace/census/analysis/vocab_SEX.csv
gs://cloud-ml-users-datalab-workspace/census/analysis/vocab_WKW.csv

Let's inspect one of the files; in particular the numerical analysis, since it will also tell us some interesting statistics about the income column, the value we want to predict.



In [9]:

    
!gsutil cat {analysis_path}/stats.json









    



{
  "WAGP": {
    "max": 145.0,
    "mean": 37.818628158844774,
    "min": 10.0
  }
}

Next Steps

This notebook completed the first steps of our machine learning workflow - data preparation and analysis. This data and the analysis outputs will be used to train a model, which is covered in the next notebook.